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Abstract 

Background: Evolutionary methods are increasingly challenged by the wealth of fast growing resources of 
genomic sequence information. Evolutionary events, like gene duplication, loss, and deep coalescence, account 
more then ever for incongruence between gene trees and the actual species tree. Gene tree reconciliation is 
addressing this fundamental problem by invoking the minimum number of gene duplication and losses that 
reconcile a rooted gene tree with a rooted species tree. However, the reconciliation process is highly sensitive to 
topological error or wrong rooting of the gene tree, a condition that is not met by most gene trees in practice. 
Thus, despite the promises of gene tree reconciliation, its applicability in practice is severely limited. 

Results: We introduce the problem of reconciling unrooted and erroneous gene trees by simultaneously rooting 
and error-correcting them, and describe an efficient algorithm for this problem. Moreover, we introduce an error- 
corrected version of the gene duplication problem, a standard application of gene tree reconciliation. We 
introduce an effective heuristic for our error-corrected version of the gene duplication problem, given that the 
original version of this problem is NP-hard. Our experimental results suggest that our error-correcting approaches 
for unrooted input trees can significantly improve on the accuracy of gene tree reconciliation, and the species tree 
inference under the gene duplication problem. Furthermore, the efficiency of our algorithm for error-correcting 
reconciliation is capable of handling truly large-scale phylogenetic studies. 

Conclusions: Our presented error-correction approach is a crucial step towards making gene tree reconciliation 
more robust, and thus to improve on the accuracy of applications that fundamentally rely on gene tree 
reconciliation, like the inference of gene-duplication supertrees. 



Background 

The wealth of newly sequenced genomes has provided us 
with an unprecedented resource of information for phyloge- 
netic studies that will have extensive implications for a host 
of issues in biology, ecology, and medicine, and promise 
even more. Yet, before such phylogenies can be reliably 
inferred, challenging problems that came along with the 
newly sequenced genomes have to be overcome. Evolution- 
ary biologists have long realized that gene-duplication and 
subsequent loss, a fundamental evolutionary process [1], 
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can largely obfuscate phylogenetic inference [2]. Gene- 
duplication can form complex evolutionary histories of 
genes, called gene trees, whose topologies are traditionally 
used to derive species trees. This approach relies on the 
assumption that the topologies from gene trees are consis- 
tent with the topology of the species tree. However, fre- 
quently genes that evolve from different copies of ancestral 
gene-duplications can become extinct and result in gene 
trees with correct topologies that are inconsistent with the 
topology of the actual species tree (see Figure 1). In many 
such cases phylogenetic information from the gene trees is 
indispensable and may still be recovered using gene tree 
reconciliation. 
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Figure 1 Rooted reconciliation. An lea-mapping M from the gene tree Q into the species tree S ar| d the corresponding embedding. M is 
shown for the internal nodes of Q . 





Related work 

Gene tree reconciliation is a well-studied method for 
resolving topological inconsistencies between a gene tree 
and a trusted species tree [2-7]. Inconsistencies are 
resolved by invoking gene-duplication and loss events 
that reconcile the gene tree to be consistent with the 
actual species tree. Such events do not only reconcile 
gene trees, but also lay foundation for a variety of evolu- 
tionary applications including ortholog/paralog annota- 
tion of genes, locating episodes of gene-duplications in 
species trees [8-10], reconstructing domain decomposi- 
tions [11], and species supertree construction [8,12-14]. 

A major problem in the application of gene tree recon- 
ciliation is its high sensitivity to error-prone gene trees. 
Even seemingly insignificant errors can largely mislead 
the reconciliation process and, typically undetected, infer 
incorrect phylogenies (e.g., [7,15]). Errors in gene trees 
are often topological errors and rooting errors. Topologi- 
cal error results in an incorrect topology of the gene tree 
that can be caused by the inference process (e.g. noise in 
the underlying sequence data) or the inference method 
itself (e.g. heuristic results). This problem has been 
addressed for rooted gene trees by 'correcting the error'; 
that is, editing the given tree such that the number of 
invoked gene-duplications and losses is minimized 
[16,17]. However, most inference methods used in prac- 
tice return only unrooted gene trees (e.g. parsimony and 
maximum likelihood based methods) that have to be 
rooted for the gene tree reconciliation process. Rooting 
error is a wrongly chosen root in an unrooted gene tree. 
Whereas rooting can be typically achieved in species 
trees by outgroup analysis, this approach may not be pos- 
sible for gene trees if there is a history of gene duplica- 
tion and loss [7]. Other rooting approaches like midpoint 
rooting or molecular clock rooting assume a constant 
rate of evolution that is often unrealistic. However, root- 
ing problems can be bypassed by identifying roots that 
minimize the invoked number of gene duplications and 
losses [7,16-19]. 

In summary, even small topological error or a slightly 
misplaced root can incorrectly identify enormous numbers 
of gene duplications and losses, and therefore largely mis- 
lead the reconciliation process. Therefore, gene tree recon- 
ciliation requires gene trees that are free of error and 



correctly rooted at the same time [5]. However, as pre- 
vious work has incorporated topological error-correction 
only separately from correctly rooting gene trees into the 
reconciliation process [16,18], this process can still be 
misled. 

Our contribution 

We address the problem of reconciling erroneous and 
unrooted gene trees by error-correcting and rooting 
them at the same time. Solving this problem efficiently is 
a crucial step towards making gene tree reconciliation 
more robust, and thus to improve on the accuracy of 
applications that rely on gene tree reconciliation like the 
construction of gene-duplication supertrees. We intro- 
duce the problem and design an efficient algorithm that 
facilitates a much more precise gene tree reconciliation, 
even for large-scale data sets. Our algorithm detects and 
corrects errors in unrooted gene trees, and thus we avoid 
the biologists' difficulty and uncertainty of handling erro- 
neous gene trees and correctly rooting them. The pre- 
sented experimental results suggest that our novel 
reconciliation algorithms can identify and correct topolo- 
gical error in unrooted input gene trees, and at the same 
time root them optimally. 

Our algorithm is designed to search for the correct and 
rooted tree of a given unrooted tree in local search neigh- 
borhoods of the given tree. The size of these neighbor- 
hoods is described by a positive integer k that allows to 
fine-tune the search. While in theory k can be large it is 
assumed that gene trees have only small topological 
error, which typically can be captured by small values of 
k. For a fixed but freely choosable integer k the runtime 
of our algorithm is 0(l k + max(«, m)), where n and m is 
the size of the gene tree and species tree respectively, and 
/ is the number of edges in the gene tree that potentially 
contain an error (such edges will be called weak). Thus, 
for a small error, which is expressed by k = 1, our algo- 
rithm runs in linear time. Our experiments show that 
error-correction runs of the algorithm for k = 3 are still 
possible even for trees with large number of weak edges 
(e.g., / = 200) on a standard workstation configuration. 

Further, we address the problem of constructing 
rooted supertrees by reconciling unrooted and erro- 
neous gene trees with assigned weak edges, a key 
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problem in illuminating the role and effect of gene 
duplication and loss in shaping the evolution of organ- 
isms. We introduce the problem and develop an effec- 
tive local search heuristic that makes the construction of 
more accurate supertrees possible and allows a much 
better postulation of gene duplication histories. Our 
experimental results demonstrate that our approach is 
effective in identifying gene duplication histories given 
erroneous gene trees and producing more accurate 
supertrees under gene tree reconciliation. 

Duplication-loss model 

We introduce the fundamentals of the classical duplica- 
tion-loss model. Our definitions are mostly adopted 
from [18]. For a more detailed introduction to the dupli- 
cation-loss model we refer the interested reader to 
[2,5,10,20]. 

Let g be the set of species consisting of N > 0 ele- 
ments. The unrooted gene tree is an undirected acyclic 
graph in which each node has degree 3 (internal nodes) 
or 1 (leaves), and the leaves are labeled by the elements 
from g. A species tree S is a rooted binary tree with N 
leaves uniquely labeled by the elements from g. In some 
cases, a node of a tree will be referred by "cluster" of 
labels of its subtree leaves. For instance, a species tree 
(a, {b, c)) has 5 nodes denoted by: a, b, c, be and abc. A 
rooted gene tree is a rooted binary tree with leaves 
labeled by the elements from g. The internal nodes of a 
tree T we denote by int(T)- 

Let S = (Vs,Es) be a species tree. S can be viewed as 
an upper semilattice with + a binary least upper bound 
operation and T the top element, that is, the root. In 
particular for a, b €V$, a <b means that a and b are 
on the same path from the root, with b being closer to 
the root than a. We define the comparability predicate 
D(a, b) = 1, if a < b or b < a and D(a, b) = 0, when a 
and b are incomparable. The distance function p(a, b) is 
used to denote the number of edges on the unique 
(non-directed) path connecting a and b. 

We call distinct nodes a, b €V$ siblings when a + b 
is a parent of a and b. For a, b e Vs let Sb(«, b) be the 
set of nodes defined by the following recurrent rule: (i) 
Sb(fl, b) = 0 if a = b or a and b are siblings, (ii) Sb(a, 
b) = {c} U Sb(« + c, b), if a <b or a + c <a + b; here c is 
the sibling of a, and (iii) Sb(a, b) = Sb{b, a) otherwise. 

By L(a, b) we denote the number of elements in Sb(a, 
b). Observe that L{a, b) = p(a, b) - 2 • (1 - D(a, b)). Let 
M : Vg —> Vs be the least common ancestor (lea) map- 
ping, from rooted Q into S that preserves the labeling 
of the leaves. Formally, if v is a leaf in Q then M{v) is 
the node in S labeled by the label of v. If v is internal 
node in Q with two children a, b, then M(y) = M{a) + 
M{b). An example is depicted in Figure 1. 



In this general setting let us assume that we are given a 
cost function f : Vg x Vs — ► R which for all nodes 
a € Vs » a € Vs assigns a real f(v, a) representing a contri- 
bution to node a which comes from v when reconciling Q 

with S ■ Having f we can define k{v) = ^ f (v, a) to be 
a total contribution from v in the reconciliation of Q with 
S ■ We call k a contribution function. Finally, a = J2 V ^ ( y ) 
is the total cost of reconciliation of Q with S ■ 

Now we present examples of cost functions that are 
used in the duplication model. We assume that if v is 
an internal node in Q then W\ and w 2 are its children. 
The Duplication cost function is defined as follows: £°{v, 
a) = 1 if v e int (G) and M{v) = M{w t ) = a for some i, 
and <f(y, a) = 0 otherwise. The Loss cost function: ^{v, 
a) = 1 if v € int {Q) and a e Sb (MK), M(w 2 )), and £ 
(v, a) = 0 otherwise. It can be proved that if v e int (G) 
then k d {v) = £)(M(W!), M(w 2 )) and k l (v) = L(M(iv 1 ), M 
(w 2 )) (in both cases 0 if v is a leaf). 

The Duplication cost function is defined as follows: f 23 
(v, a) = 1 if v € int (Q) and M(v) = M(wj) = a for some i, 
and c°(v, a) = 0 otherwise. Loss cost function: ^(v, a) = 
1 if v e int {Q) and a e Sb(/M(w 1 ), M(w 2 )), and <f(v, a) = 
0 otherwise. It can be proved that if v e int (Q) then k d 
(v) = DiMiwJ, M(w 2 )) and n L (v) = L(M( Wl ), M{w 2 )) (in 
both cases 0 if v is a leaf). 

Observe that a node v e Vg is called a duplication 
[4,13] if k d (v) = 1. Moreover, k l {v) = l{v), where l{v) is 
the number of gene losses associated to v. It can be 
proved that o° and O 1 are the minimal number of gene 
duplications and gene losses (respectively) required to 
reconcile (or to embed) Q with S ■ Please refer to [18] 
for more details. The example of an embedding is 
depicted in Figure 1. 

Introduction to unrooted reconciliation 

Here we highlight some results from [18] that are used 
for the design of our algorithm. From now on, we assume 
that Q = (Vg^g^ is an unrooted gene tree. We define a 
rooting of Q by selecting an edge e e Eg on which the 
root is to be placed. Such a rooted tree will be denoted 
by Ge > where v» is a new node defining the root. To dis- 
tinguish between rootings of Q , the symbols defined in 
previous section for rooted gene trees will be extended 
by inserting index e. Please observe, that the mapping of 
the root of Q e is independent of e. Without loss of gener- 
ality the following is assumed: (Al) S and Q have at 
least one internal node and (A2) M e (v.)=T; that is, the 
root of every rooting is mapped into the root of S (we 
may always consider the subtree of the species tree 
rooted in M e (v«) with no change of the cost). 

First, we transform Q into a directed graph 
Q = {Vg,Eg) where Eg = {(v, w)\ {v, w} e Eg}. In other 
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words each edge (v, w) in Q is replaced in Q by a pair 
of directed edges (v, w) and (w, v). 

Edges in Q are labeled by nodes of S as follows. If 
v € Vg is a leaf labeled by a, then the edge (v, w) e Eg 
is labeled by a. When v is an internal node in Q we 
assume that {w lt v) and (w 2 , v) are labeled by b\ and b 2 , 
respectively. Then the edge (j/,u/ 3 ) e Eg, such that w 3 * 
Wi and w 3 * w 2 is labeled by b\ + b 2 . Such labeling will 
be used to explore mappings of rootings of Q . An edge 
{v, w} in Q is called asymmetric if exactly one of the 
labels of (v, w) and (w, v) in Q is equal to T, otherwise 
it is called symmetric. 

Every internal node v, and its neighbors in Q define a 
subtree of Eg , called a star with a center v, as depicted 
in Figure 2. The edges (v, tv,-) are called outgoing, while 
the edges (w„ v) are called incoming. We will refer to 
the undirected edge {v, w ; j as e it for z = 1, 2, 3. 

The are several types of possible star topologies based 
on the labeling (for proofs and details see [18]): (SI) a 
star has one incoming edge labeled by T and two out- 
going edges labeled T and these edges are connected to 
the three siblings of the center, (S2) a star has exactly 
two outgoing edges labeled by T, (S3) a star has all out- 
going edges and exactly one incoming edgd labeled by T, 
(S4) a star has all edges labelled by top, and (S5) a star 
has all outgoing edges and exactly two incoming edges 
labeled by T. Figure 2 illustrates the star topologies. 

In summary stars are basic 'puzzle-like' units that can 
be used to assemble them into unrooted gene trees. 
However, not all star compositions represent a gene 
tree. For instance, there is no gene tree with 3 stars of 
type S2. It follows from [18] (see Lemma 4) that we 
need the following additional condition: (CI) if a gene 
tree has two stars of type S2 then they share a common 
edge. 

Now we overview the main result of [18] (see Theo- 
rem 1 for more details). Let S be a species tree and Q 
be unrooted gene tree. The set of optimal edges, that is, 
candidates for best rootings, is defined as follows: 

Ming = {e e Eg \af a,f is minimal}, where a f°* is the 
total cost for the weighted mutation cost defined by 



, e is an edge in Q and a, /3 are two positive reals. Then 
(Ml) if | Ming | > 1, then Ming consists of all edges 
present in all stars of type S4 or S5, (M2) if |Ming| = 1 , 
then Ming contains exactly one symmetric edge that is 
present in star of type S2 or S3. From the above state- 
ments, (CI) and star topologies we can easily determine 
Ming . More precisely, the star edges outside Ming are 
asymmetric and share the same direction. Thus, to find 
an optimal edge it is sufficient to follow the direction of 
non T edges in Q . 

Now we summarize the time complexity of this proce- 
dure. It follows from [21] that a single lea-query (that, is 
a + b for nodes a and b in S ) can be computed in con- 
stant time after an initial preprocessing step requiring 
0(|6>|) time. Other structures like Q with the labeling 
can be computed in 0(|(?|) time. The same complexity 
has the procedure of finding an optimal edge in Q . In 
summary an optimal edge/rooting and the minimal cost 
can be computed in linear time. See [18] for more 
details and other properties. 

Methods 

First we describe our algorithm for computing the opti- 
mal cost and the set of optimal edges after one nearest 
neighbor interchange (NNI) operation performed on an 
unrooted gene tree, and then extend it to a general case 
with k NNI operations. For the definition of NNI please 
refer to Def. 1 and Figure 3. 

Algorithm 

Now we show that a single NNI operation can be com- 
pleted in constant time if all structures required for 
computing the optimal rootings are already constructed. 
First, let us assume that the following is given: (a) two 
positive reals a and fi, a species tree S , (b) lea structure 
for S that allows to answer lea-queries in constant 
time, (c) an unrooted gene tree Q , (d) Q with the label- 
ing of edges, (e) Ming - the set of optimal edges, and (f) 
a - the minimal total weighted mutation cost. As 
observed in the previous section (b),(d)-(f) can be com- 
puted in 0(max(|<S|, \Q\)~) . Now we show that (c)-(f) 
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Figure 2 Unrooted reconciliation, a) A star in Q . b) Types of edges, c) All possible types of stars. We use simplified notation instead of the 
full topology. 
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Figure 3 NNI. A single NNI on Q and Q . On the left e, and (for / = 0, ... , 4) denote edges in Q and its NNI-neighbor Q' , respectively. On 
the right each node a, denote the labeling of edges in Q . Notation a,- denote the lea-mapping of complementary subtrees, for instance, 
#3 = flj + a 2 + CI4, etc. For brevity, we omit each subtree J, attached to w, in the left diagram. 



can be computed in constant time after a single NNI 
operation. 

NNI operation (c) and the update of lea-mappings (d). 

Definition 1. (Single NNI operation) An NNI opera- 
tion transforms a gene tree Q = ((Ti/Tz), p3, T4)) into 
G' = ((72/ 73), (Ti, T4)) , where T r s are (rooted) subtrees 
of Q . The edge that connects the roots of (7\, T 2 ) and 
(X 3) T 4 ) in Q is denoted by e 0 and called the center 
edge. For each i = 1, 2, 3, 4 we assume the following: w t 
is the root of Ty e,- is the edge connecting w t with eo and 
a t is the lea-mapping of T h Similarly, we define the cen- 
ter edge e' Q and e[ in Q' . 

An NNI operation is depicted in Figure 3 with the 
transformation of Q into Q' . The notation will be used 
from now on. Note that there is a second NNI opera- 
tion, when Q is replaced with {(T lt T 3 ), (T 2 , T 4 )). How- 
ever, it can be easily defined and therefore it is omitted 
here. Observe that the NNI operation (without updating 
of lea-mappings) can be performed in constant time for 
both trees. 

The right part of Figure 3 depicts the transformation 
of Q . Observe that the labels of the incoming and out- 
going edges attached to each w, in Q do not change 
during this operation. Lemma 1 follows directly from 
this observation. 

Lemma 1. An NNI operation changes only the labels 
of the center edge. 

We conclude that updating Q requires only two lea- 
queries, and therefore can be performed in constant 
time. 

Reconstruction of optimal edges (e). We analyze the 
changes of the optimal set of edges Ming . To this end 
we consider a number of cases depending on the rela- 
tion between the optimal set of edges and the set of 
edges, incident to the nodes of the center edge. Let 
Cg = {e,}i=o,...,4- 

For convenience, assume that the NNI operation 
replaces e, with e' i as indicated in Figure 3. We call two 
disjoint edges from Cg semi-alternating if they share a 
common node after the NNI operation. In Figure 3 {e it 



e 4 } and {e 2 , e 3 } are semi-alternating. For two edges a 
and b that are incident to the same node let *(a, b) be 
the set of three edges defining the unique star that con- 
tains a and b. 

Lemma 2. Assuming that e,- is replaced by e l i after the 
NNI operation the set of optimal edges does not require 
additional changes if and only if one of the following 
conditions is satisfied: (EQ1) Ming n Cg = 0, 

(EQ2) Ming 3 Cg and each pair of semi-alternating 
edges contains at least one symmetric edge, 

(EQ3) Ming consists of only the center edge, 

(EQ4) Ming fl Cg = {ej for some i >0 and the center 
is asymmetric after the NNI operation. 

Proof: (EQ1) All edges in Cg are asymmetric (2 stars 
SI). Then, after the NNI operation e' 0 is asymmetric 
and (Cg/ has 2 stars SI). (EQ2) Cg consists of 2 stars 
of type S4/S5 and at most two asymmetric edges. It fol- 
lows from EQ2 that the asymmetric edges in Cg> cannot 
form a star of type other than S5. Together with Ml it 
follows that Cg' is optimal. (EQ3) By Ml the center is 
symmetric in Q . It remains symmetric after NNI. From 
CI and M2, Ming/ consists of the center edge. (EQ4) 
Note, that the type of ★(<?-, e' 0 ) is SI, 52 or 53. 

Lemma 3 (NE1). If Ming 3 Cg and there exists a 
pair {e,-, e ; } of asymmetric semi-alternating edges, then 
Min'g = Ming\Cg U {Cg\{e\, £•'}). 

Proof: The type of *(4' e p is SI or S3 and the other 
star has type S4 or S5. By M2 e' i and ^ are not optimal. 

Lemma 4 (NE2). If Ming n Cg = {et\ for some i >0 
and the center is symmetric after the NNI operation then 
Min'g = Ming\{e,} U *(e' 0 , e'A . 

Proof: In this case e' Q has two arrows and *(e' 0 , ej) is 
of type S5. 

Lemma 5. Assume that Ming fl Cg = {eo,ei,ej}, where 
i * 0, 

(NE3) If both e,- and e, are symmetric then 
Ming/ = Ming\Cg U Cg/, 

(NE4) If ej is asymmetric and e' 0 is symmetric then 
Ming/ = Ming\Cg U *(e' 0 , e-). 
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(NE5) If both e ; - and e' 0 are asymmetric then 
Ming' = Ming\Cg U {e\}. 

Proof: Note that {e 0 , e b ej\ must be a star in 
Q ■ (NE3) ★ (e„ ef] has type S4 or S5. After the transfor- 
mation the two stars *{e' 0 ,e' i ) and +(e' 0 ,e'j) have type S5. 
Both are optimal in Q' ■ (NE4) ★ {e it ej) has type S5. 
After the transformation *{e' 0 ,e'i) has type S5 and 
*(e' 0 ,e'j) has type S3. Only the first is optimal in 
Q' ■ (NE5) ★ (e„ ej) has type S5 while the other star in 
Cg has type S3. After the transformation only e\ 
remains symmetric in Cg' therefore it is the only opti- 
mal edge in Cg< . 

Computing the optimal cost (f). Observe that from 
Lemmas 2-5 at least one optimal edge remains optimal 
after the NNI operation. Therefore, to compute the dif- 
ference in costs between optimal rootings of Q and Q' 
we start with the cost analysis for the rootings of such 
edge. 

First, we introduce a function for computing the cost 
differences. Consider three nodes x, y, z of some rooted 
gene tree such that x and y are siblings and the parent 
of them (denoted by xy), is a sibling of z. In other 
words we can denote this subtree by ((x, y), z). Then, 
the partial contribution of {{x, y), z) to the total 
weighted mutation cost can be described as follows: 
a * (? D (xy, a) + % D {xyz, a)) + fl * (t; L [xy, a) + % L {xyz, a)) . 

Assume that x, y and z are mapped into a, b and c 
(from the species tree), respectively. It can be proved 
from the defnition of f and f that the above contribu- 
tion equals: (p(a, b, c) = a * (D(a, b) + D(a + b, c)) + ft * 
(L(a, b) + L(a + b, c)). Now, assume that a single NNI 
operation changes {(x, y), z)) into (x, (y, z)). It should be 
clear that the cost difference is given by: A 3 (a, b, c) = cp 
(c, b, a) - (p(a, b, c). Similarly, we can define a cost dif- 
ference when a single NNI operation changes ((x, y), (z, 
v)) into ({x, v), (y, z)). Assume, that v is mapped into d. 
Then, the cost contribution of the first subtree is (p'(a, 
b, c, d) = (p(a, b, c + d) + a" (D{c, d) + [} * L(c, d). The 
cost difference is given by: A 4 (a, b, c, d) = (p'(a, d, b, c) 
- (p'(a, b, c, d). 

Lemma 6. If the center edge is optimal and remains 
optimal after the NNI operation then the cost difference 
equals A 4 («x, « 2 , a 3 , a 4 ), where a t (for i = 1, 2, 3, 4) is 
the mapping as indicated in Figure 3. 

As mentioned the above lemma can be proved by 
comparing the rootings placed on the center edges in Q 
and Q' . Lemma 6 gives a solution for cases: EQ2, EQ3, 
NE1 and NE3. The next lemma gives a solution for the 
remaining cases. 

Lemma 7. If for some i >0 there exists an optimal edge 
in Ti U {e,j that remains optimal after the NNI operation 



(under assumption that e t is replaced by e\) then the cost 
difference is A 3 (<2 4 , « 3 , a 2 ) if i = 1> A 3 (<2 3 , <z 4j a-s) if i = 2, 
A 3 (a 2 i «i> «4) if i = 3 and A 3 («i, « 2 > ^3) if i = 4. 

Similarly to Lemma 6 we can prove Lemma 7 by com- 
paring the rootings of e, and e- . 

Error correction algorithm. Finally, we can present 
the algorithm for computing the optimal weighted 
mutation cost for a given gene tree and its /r-NNI 
neighborhood. See Figure 4 for details. It should 
be clear that the complexity of this algorithm is 
0(\Q\ k + max(|(?|, \S\)) ■ We write that a gene tree has 
errors if the optimal cost is computed for one of its 
NNI variants. Otherwise, we write that a gene tree 
does not require corrections. Please note that it for a 
special case of k = 1, this algorithm is linear in time 
(see also our preliminary article [22]). 

General reconstruction problems 

We present several approaches to problems of error cor- 
rection and phylogeny reconstruction. Let us assume 
that cr„ ;/ g ; fe(<S, Q) is the cost computed by algorithm 
from Figure 4, where a, fi > 0, k > 0, S is a rooted spe- 
cies tree and Q is an unrooted gene tree. 

Problem 1 (£NNIC). Given a rooted species tree S 
and a set of unrooted gene trees, G compute the total 
cost 'Yl g(zG cy oi,ii,k{S, Q). 

The A:NNIC problem can be solved in polynomial time 
by an iterative application of our algorithm. Addition- 
ally, we can reconstruct the optimal rootings as well as 
the correct topology of each gene tree. Please note that 
for k = 0 (no error correction), we have the cost infer- 
ence problem for the reconciliation of an unrooted gene 
tree with a rooted species tree [18]. 

Problem 2 (£NNIST). Given a set of unrooted gene 
trees G find the species tree S that minimizes the total 

COSt ^g^Va.fi.kiS, Q). 

The complexity of the £NNIST problem is unknown. 
However, similar problems for the duplication model 
are NP-hard [13]. Therefore we developed heuristics for 
the ANNIST problem to use them in our experiments. 

In applications there is typically no need to search over 
all NNI variants of a gene tree. For instance, a good can- 
didate for an NNI operation is a weak edge. A weak edge 
is usually defined on the basis of its length, where short 
length indicates weakness. To formalize this property, let 
us assume that each edge in a gene tree Q has length. 
We call an edge e in Q weak if the length of e is smaller 
than m, where m is a non-negative real. Now we can 
define variants of £NNIC and £NNIST denoted by m- 
/rNNIC and eu-ArNNIST, respectively, where the NNI 
operations are performed on weak edges only. These 
straighforward definitions are omitted. Please note that 
the time complexity of the algorithm with NNIs limited 
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i Input A specks tree S, an unrooted gene tree G. a, 0 > o. k > 0. 

'_>. Output Optimal weighted cost for Q and its A-N'NI ihmrIiImmIkhhI. 

:». Data preparation compute: the optimal weighted mutation cost n. Min i; . lea structure for $ and c; 
by the unrooted reconciliation algorithm. Let minrost := a. 

J. for 6Mb sequence c 1 e* of internal edges in Q do minrnst := tnin{tnit>cost. nnia>st(i ' r * )). 

•v return tnincost 

6. Procedure <, i J ) 

7. if j— 0 rciiirn +oc 

8. Transform Q into Q' and Q into tj' (in situ). 

9. I'ptlate Miii^ according to < ascs NK1-NK") ami adjust the cost er (Lemma (i. 7). 
10. tnincost := min(inincost.<j. nitico.«t(< ' c J )) 

11 Perform the reverse transformation to reconstruct Q. Q and n. 

12. Kxccute all steps V 1 I lor the second NNI operation on <„• 

1H. return mincost 

Figure 4 Algorithm. Optimal weighted cost for Q and its /c-NNI neighborhood. 



to weak edges is 0(l k + max(|C/|, |<S|))> where / is the 
number of weak edges in Q . 

Software 

The unrooted reconciliation algorithm [18] and its data 
structures are implemented in program URec [23]. Our 
algorithm partially depends on theses data structures 
and therefore was implemented as a significantly 
extended version of URec. Additionally, we implemented 
a hill climbing heuristic to solve ArNNIST and m- 
kNNIST. 

Software and datasets from our experiments are made 
freely available through http://bioputer.mimuw.edu.pl/ 
~gorecki/ec. 

Experimental results and discussion 
Data preparation 

First, we inferred 4133 unrooted gene trees with branch 
lengths from nine yeast genomes contained in the Gen- 
olevures 3 data set [24], which contains protein 
sequences from the following nine yeast species: C. glab- 
rata (4957 protein sequences, abbreviation CAGL), 5. 
cerevisiae (5396, SACE), Z rouxii (4840, ZYRO), S. kluy- 
veri (5074, SAKL), K. thermotolerans (4933, KLTH), K. 
lactis (4851, KLLA), Y. lipolytica (4781, YALI), D. hanse- 
nii (5006, DEHA) and E. gossypii (4527, ERGO). 

We aligned the protein sequences of each gene family 
by using the program TCoffee [25] using the default 
parameter setting. Then maximum likelihood (unrooted) 
gene trees were computed from the alignments by using 
proml from the phylip software package. The original 
species tree of these yeasts [24], here denoted by G3, is 
shown in Figure 5. 
inferring optimal species trees 

The optimal species tree reconstructed with error cor- 
rections (1NNIST optimization problem) is depicted in 



Figure 5 and denoted by 1NNIEC. This tree differs from 
G3 in the rooting and in the middle clade with KLLA 
and ERGO. Additionally, we inferred by the heuristic an 
optimal species tree, denoted here by NOEC, with no 
error corrections (ONNIST optimization). All the trees 
from this figure are highly scored in each of the optimi- 
zation schemas. 

From weak edges to species trees 

In the previous experiment, the NNI operations were 
performed on almost every gene tree in the optimal 
solution and with no restrictions on the edges. In order 
to reconstruct the trees more accurately, we performed 
experiments for <a-A:NNIST optimization with various a> 
parameters and subsets of gene trees. The filtering of 
gene trees was determined by an integer ft > 0 that 
defines the maximum number of allowed weak edges in 
a single gene tree. Each gene tree that did not satisfy 
such condition was rejected. 

Figures 6 and 7 depict a summary of error correction 
experiments for weak edges. For each w and fi we per- 
formed 20 runs of the co-A:NNIST heuristic for finding 
the optimal species tree in the set of gene trees filtered 
by n- The optimal species trees are depicted in the dia- 
gram, where each cell represents the result of a single 
co-A:NNIST experiment. We observed that G3, 1NNIEC 
and NOEC are significantly well represented in the set 
of optimal species trees in &>-lNNIST experiments, 
while in <a-2NNIST and co-3NNIST experiments only 
G3 and NOEC were detected. Note that the original 
yeast phylogeny (G3, black squares in Figures 6 and 7) 
is inferred for m = 0.1-0.2 (in other words approx. 30- 
40% of edges are weak, see Figure 8) and n > 10 in most 
experiments. In particular for m = 0.15 and [i = 10, 364 
gene trees were rejected (see Figure 9). These results 
significantly support the G3 phylogeny. Please note that 
the results for the standard unrooted reconciliation 
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Figure 5 Yeasts phylogeny. Species tree topologies. G3 - original phylogeny of Genolevures 3 data set [24]. 1NNIEC - optimal rooted species 
tree inferred from gene trees with all possible 1-NNI error corrections. NOEC - optimal species tree for the yeast gene trees with no NNI 
operations (cost 64413, no corrections). Rank denotes a position of a tree on the sorted list of the best trees. The trees below are inferred from 
other cu-/(NNIST (see next figures). Please note that NOEC, G3, a1 and a2 are rooted variants of the same unrooted tree. Similar property holds for 
1NNIEC, b! and b2. 



algorithms without error correction are located in the 
first column of diagrams (m = 0). 
From trusted species tree to weak edges in gene trees - 
automated and manual curation 

Assume that the set of unrooted gene trees and the 
rooted (trusted) species tree S are given. Then we can 
state the following problem: find co and \x such that S 
is the optimal species tree in ey-NNIST problem for the 
set of gene trees filtered by \i. For instance in our data- 
set, if we assume that G3 is a given correct phylogeny of 
yeasts, then from the diagrams (Figure 6 and 7) one can 
determine appropriate values of a> and \i that yield G3 
as optimal. In other words we can automatically deter- 
mine weak edges by m and filter gene trees by p. This 
approach can be applied in tree curation procedures to 
correct errors in an automated way as well as to find 
candidates (rejected trees) for further manual curation. 
For instance, in the previous case, when m = 0.1 and [i 
= 10, we have 3164 trees that can be corrected and 
rooted by our algorithm, while the 364 rejected trees 
could be candidates for further manual correction. 

Discussion 

We present novel theoretical and practical results on the 
problem of error correction and phylogeny 



reconstruction. In particular, we describe a polynomial 
time and space algorithm that simultaneously solves the 
problem of correction topological errors in unrooted 
gene trees and the problem of rooting unrooted gene 
trees. The algorithm allows us to perform efficiently 
experiments on truly large-scale datasets available for 
yeast genomes. Our experiments suggest that our algo- 
rithm can be used to (i) detect errors, (ii) to infer a cor- 
rect phylogeny of species under the presence of weak 
edges in gene trees, and (iii) to help in tree curation 
procedures. 

Conclusion 

We introduced a novel polynomial time algorithm for 
error-corrected and unrooted gene tree reconciliation. 
Experiments on yeast genomes suggests that an imple- 
mentation of our algorithm can greatly improve on the 
accuracy of gene tree reconciliation, and thus, curate 
error-prone gene trees. Moreover, we use our error-cor- 
rected reconciliation to make the gene duplication pro- 
blem, a standard application of gene tree reconciliation, 
more robust. We conjecture that the error-corrected 
gene duplication problem is intrinsically hard to solve, 
since the gene duplication problem is already NP-hard. 
Therefore, we introduced an effective heuristic for 
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error-corrected gene duplication problem. Our experi- gene duplication and loss events that then allow to infer 

mental results for a wide range of error-correction tests more accurate phylogenies. 

on yeasts phylogeny show that our error-corrected The presented error correction is based on gene-spe- 

reconciliations result in improved predictions of invoked cies tree reconciliation using gene duplication and loss. 
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However, there are other major evolutionary mechanism 
that infer gene tree topologies that are inconsistent with 
the actual species tree topology, like horizontal gene 
transfer and deep coalescence. Gene tree reconciliation 
using these mechanisms is highly sensitive to topological 
error, similar to gene tree reconciliation under gene 
duplication and loss. Future work will focus on the 
development of algorithms that can also reconcile 
unrooted and erroneous gene trees using horizontal 
gene transfer and deep coalescence. 
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